Hemlata Channe

Project 4: Unsupervised Learning - Clustering

Data Description:

The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Domain: Object recognition

Context:

The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:

● All the features are geometric features extracted from the silhouette. ● All are numeric in nature.

Learning Outcomes:

● Exploratory Data Analysis ● Reduce number dimensions in the dataset with minimal information loss ● Train a model using Principle Components

Objective:

Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.

In [1]:
import numpy as np   
from sklearn.linear_model import LinearRegression
import pandas as pd    
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns
from scipy.stats import zscore

from sklearn.decomposition import PCA
from sklearn import svm

Task 1: Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm (10 marks)

In [2]:
Data = pd.read_csv("vehicle-1.csv")
Data.head()
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [3]:
Data.describe().transpose
Out[3]:
<bound method DataFrame.transpose of        compactness  circularity  distance_circularity  radius_ratio  \
count   846.000000   841.000000            842.000000    840.000000   
mean     93.678487    44.828775             82.110451    168.888095   
std       8.234474     6.152172             15.778292     33.520198   
min      73.000000    33.000000             40.000000    104.000000   
25%      87.000000    40.000000             70.000000    141.000000   
50%      93.000000    44.000000             80.000000    167.000000   
75%     100.000000    49.000000             98.000000    195.000000   
max     119.000000    59.000000            112.000000    333.000000   

       pr.axis_aspect_ratio  max.length_aspect_ratio  scatter_ratio  \
count            844.000000               846.000000     845.000000   
mean              61.678910                 8.567376     168.901775   
std                7.891463                 4.601217      33.214848   
min               47.000000                 2.000000     112.000000   
25%               57.000000                 7.000000     147.000000   
50%               61.000000                 8.000000     157.000000   
75%               65.000000                10.000000     198.000000   
max              138.000000                55.000000     265.000000   

       elongatedness  pr.axis_rectangularity  max.length_rectangularity  \
count     845.000000              843.000000                 846.000000   
mean       40.933728               20.582444                 147.998818   
std         7.816186                2.592933                  14.515652   
min        26.000000               17.000000                 118.000000   
25%        33.000000               19.000000                 137.000000   
50%        43.000000               20.000000                 146.000000   
75%        46.000000               23.000000                 159.000000   
max        61.000000               29.000000                 188.000000   

       scaled_variance  scaled_variance.1  scaled_radius_of_gyration  \
count       843.000000         844.000000                 844.000000   
mean        188.631079         439.494076                 174.709716   
std          31.411004         176.666903                  32.584808   
min         130.000000         184.000000                 109.000000   
25%         167.000000         318.000000                 149.000000   
50%         179.000000         363.500000                 173.500000   
75%         217.000000         587.000000                 198.000000   
max         320.000000        1018.000000                 268.000000   

       scaled_radius_of_gyration.1  skewness_about  skewness_about.1  \
count                   842.000000      840.000000        845.000000   
mean                     72.447743        6.364286         12.602367   
std                       7.486190        4.920649          8.936081   
min                      59.000000        0.000000          0.000000   
25%                      67.000000        2.000000          5.000000   
50%                      71.500000        6.000000         11.000000   
75%                      75.000000        9.000000         19.000000   
max                     135.000000       22.000000         41.000000   

       skewness_about.2  hollows_ratio  
count        845.000000     846.000000  
mean         188.919527     195.632388  
std            6.155809       7.438797  
min          176.000000     181.000000  
25%          184.000000     190.250000  
50%          188.000000     197.000000  
75%          193.000000     201.000000  
max          206.000000     211.000000  >
In [4]:
Data.columns
Out[4]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')
In [5]:
Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [6]:
Data.shape
Out[6]:
(846, 19)

Findings from column description

  1. There are 19 columns with 846 rows in the dataset.
  2. All the columns are numeric
  3. The dataset have few null values in many columns 'circularity', 'distance_circularity', 'radius_ratio', 'pr.axis_aspect_ratio','scatter_ratio','elongatedness','pr.axis_rectangularity','scaled_variance',scaled_variance.1','scaled_radius_of_gyration','scaled_radius_of_gyration.1','skewness_about','skewness_about.1','skewness_about.2'

Preprocessing of Dataset

Instead of dropping the rows, replacing the missing values with median value.

In [7]:
Data.median()
Out[7]:
compactness                     93.0
circularity                     44.0
distance_circularity            80.0
radius_ratio                   167.0
pr.axis_aspect_ratio            61.0
max.length_aspect_ratio          8.0
scatter_ratio                  157.0
elongatedness                   43.0
pr.axis_rectangularity          20.0
max.length_rectangularity      146.0
scaled_variance                179.0
scaled_variance.1              363.5
scaled_radius_of_gyration      173.5
scaled_radius_of_gyration.1     71.5
skewness_about                   6.0
skewness_about.1                11.0
skewness_about.2               188.0
hollows_ratio                  197.0
dtype: float64
In [8]:
# replace the missing values with median value.
# Note, we do not need to specify the column names below
# every column's missing value is replaced with that column's median respectively  (axis =0 means columnwise)
#Data = Data.fillna(Data.median()) 

medianFiller = lambda x: x.fillna(x.median() if x.name!='class' else x)
Data = Data.apply(medianFiller,axis=0)
In [9]:
Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  846 non-null    float64
 2   distance_circularity         846 non-null    float64
 3   radius_ratio                 846 non-null    float64
 4   pr.axis_aspect_ratio         846 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                846 non-null    float64
 7   elongatedness                846 non-null    float64
 8   pr.axis_rectangularity       846 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              846 non-null    float64
 11  scaled_variance.1            846 non-null    float64
 12  scaled_radius_of_gyration    846 non-null    float64
 13  scaled_radius_of_gyration.1  846 non-null    float64
 14  skewness_about               846 non-null    float64
 15  skewness_about.1             846 non-null    float64
 16  skewness_about.2             846 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [10]:
X = Data.drop(labels="class" , axis = 1)
y = Data["class"]
X.head()
Out[10]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183

Performing scaling on dataset using Standardscalar

In [11]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

Task2: Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (10 points)

Relationship between the attributes

In [36]:
Data.columns
Out[36]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')

Covariance of all variables

The column 'compactness' and 'pr.axis_aspect_ratio' have small covariance as well as correlation. The column 'compactness' and the columns 'distance_circularity', 'scatter_ratio','elongatedness', 'pr.axis_rectangularity', 'scaled_variance', 'scaled_variance.1'are highly correlated and covariance is also high above 0.75. Also other columns have very small covariance which we can observe in the below code output.

In [38]:
cv = Data.cov()
In [45]:
sorted_cv = cv.sort_values(by='compactness', axis=0, ascending=False)
sorted_cv 
Out[45]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
scaled_variance.1 1183.048547 905.058993 2461.647662 4235.335892 124.077322 116.335595 5818.327065 -1315.091741 451.485940 2035.770195 5234.325701 31150.958419 4566.814765 -20.301074 66.529926 316.534052 6.752894 135.145899
scatter_ratio 222.142552 172.677241 472.978160 814.370529 27.143602 25.385681 1102.087967 -251.971673 85.053415 389.886258 987.646992 5818.327065 864.233907 -6.828864 12.119955 62.982302 1.149410 29.342103
scaled_variance 196.794514 153.188097 425.299717 831.086715 67.460309 46.024231 987.646992 -229.398540 75.838946 339.129701 983.476393 5234.325701 795.013063 26.485388 5.647740 54.402084 2.743418 19.991295
radius_ratio 189.708781 127.220510 403.298994 1115.650555 174.669579 69.167032 814.370529 -205.997356 61.247953 275.850739 831.086715 4235.335892 583.084529 -45.002975 7.977918 51.827988 78.542431 117.104181
scaled_radius_of_gyration 156.845875 184.837067 361.587476 583.084529 31.289907 28.414449 864.233907 -194.833526 67.119448 409.337523 795.013063 4566.814765 1059.260119 46.543111 26.567695 -16.321991 -44.942280 -28.568838
distance_circularity 102.393288 76.508841 247.796994 403.298994 19.660863 19.171329 472.978160 -112.064585 36.388956 176.978817 425.299717 2461.647662 361.587476 -26.564115 8.793202 37.332497 14.149033 38.962423
max.length_rectangularity 80.818554 85.598608 176.978817 275.850739 14.520328 20.433808 389.886258 -87.977590 30.470509 210.704141 339.129701 2035.770195 409.337523 4.512359 9.669067 0.177042 -9.282937 8.289506
compactness 67.806566 34.595378 102.393288 189.708781 5.941097 5.616954 222.142552 -50.737706 17.344216 80.818554 196.794514 1183.048547 156.845875 -15.350216 9.531814 11.547134 15.124042 22.391727
circularity 34.595378 37.629299 76.508841 127.220510 7.435406 7.097679 172.677241 -39.365101 13.392280 85.598608 153.188097 905.058993 184.837067 2.379936 4.337152 -0.626662 -3.941009 2.115060
hollows_ratio 22.391727 2.115060 38.962423 117.104181 15.697801 4.925981 29.342103 -12.604557 1.911832 8.289506 19.991295 135.145899 -28.568838 -44.564669 3.542591 13.618636 40.849272 55.335707
pr.axis_rectangularity 17.344216 13.392280 36.388956 61.247953 1.624193 1.923572 85.053415 -19.190130 6.700632 30.470509 75.838946 451.485940 67.119448 -0.299576 1.063200 4.963512 -0.296987 1.911832
skewness_about.2 15.124042 -3.941009 14.149033 78.542431 11.632821 -0.738285 1.149410 -5.533023 -0.296987 -9.282937 2.743418 6.752894 -44.942280 -34.409958 3.478056 4.247849 37.850145 40.849272
skewness_about.1 11.547134 -0.626662 37.332497 51.827988 -2.250972 1.784347 62.982302 -12.910739 4.963512 0.177042 54.402084 316.534052 -16.321991 -8.416778 -1.532242 79.762083 4.247849 13.618636
skewness_about 9.531814 4.337152 8.793202 7.977918 -2.255923 0.351933 12.119955 -2.014755 1.063200 9.669067 5.647740 66.529926 26.567695 -3.235667 24.041798 -1.532242 3.478056 3.542591
pr.axis_aspect_ratio 5.941097 7.435406 19.660863 174.669579 62.128881 23.527685 27.143602 -11.270326 1.624193 14.520328 67.460309 124.077322 31.289907 9.004155 -2.255923 -2.250972 11.632821 15.697801
max.length_aspect_ratio 5.616954 7.097679 19.171329 69.167032 23.527685 21.171195 25.385681 -6.474984 1.923572 20.433808 46.024231 116.335595 28.414449 10.162999 0.351933 1.784347 -0.738285 4.925981
scaled_radius_of_gyration.1 -15.350216 2.379936 -26.564115 -45.002975 9.004155 10.162999 -6.828864 6.027143 -0.299576 4.512359 26.485388 -20.301074 46.543111 55.781984 -3.235667 -8.416778 -34.409958 -44.564669
elongatedness -50.737706 -39.365101 -112.064585 -205.997356 -11.270326 -6.474984 -251.971673 61.025507 -19.190130 -87.977590 -229.398540 -1315.091741 -194.833526 6.027143 -2.014755 -12.910739 -5.533023 -12.604557
In [12]:
covMatrix = np.cov(X,rowvar=False)
print(covMatrix)
[[ 1.00118343  0.68569786  0.79086299  0.69055952  0.09164265  0.14842463
   0.81358214 -0.78968322  0.81465658  0.67694334  0.76297234  0.81497566
   0.58593517 -0.24988794  0.23635777  0.15720044  0.29889034  0.36598446]
 [ 0.68569786  1.00118343  0.79325751  0.6216467   0.15396023  0.25176438
   0.8489411  -0.82244387  0.84439802  0.96245572  0.79724837  0.83693508
   0.92691166  0.05200785  0.14436828 -0.01145212 -0.10455005  0.04640562]
 [ 0.79086299  0.79325751  1.00118343  0.76794246  0.15864319  0.26499957
   0.90614687 -0.9123854   0.89408198  0.77544391  0.86253904  0.88706577
   0.70660663 -0.22621115  0.1140589   0.26586088  0.14627113  0.33312625]
 [ 0.69055952  0.6216467   0.76794246  1.00118343  0.66423242  0.45058426
   0.73529816 -0.79041561  0.70922371  0.56962256  0.79435372  0.71928618
   0.53700678 -0.18061084  0.04877032  0.17394649  0.38266622  0.47186659]
 [ 0.09164265  0.15396023  0.15864319  0.66423242  1.00118343  0.64949139
   0.10385472 -0.18325156  0.07969786  0.1270594   0.27323306  0.08929427
   0.12211524  0.15313091 -0.05843967 -0.0320139   0.24016968  0.26804208]
 [ 0.14842463  0.25176438  0.26499957  0.45058426  0.64949139  1.00118343
   0.16638787 -0.18035326  0.16169312  0.30630475  0.31933428  0.1434227
   0.18996732  0.29608463  0.01561769  0.04347324 -0.02611148  0.14408905]
 [ 0.81358214  0.8489411   0.90614687  0.73529816  0.10385472  0.16638787
   1.00118343 -0.97275069  0.99092181  0.81004084  0.94978498  0.9941867
   0.80082111 -0.02757446  0.07454578  0.21267959  0.00563439  0.1189581 ]
 [-0.78968322 -0.82244387 -0.9123854  -0.79041561 -0.18325156 -0.18035326
  -0.97275069  1.00118343 -0.95011894 -0.77677186 -0.93748998 -0.95494487
  -0.76722075  0.10342428 -0.05266193 -0.18527244 -0.11526213 -0.2171615 ]
 [ 0.81465658  0.84439802  0.89408198  0.70922371  0.07969786  0.16169312
   0.99092181 -0.95011894  1.00118343  0.81189327  0.93533261  0.98938264
   0.79763248 -0.01551372  0.08386628  0.21495454 -0.01867064  0.09940372]
 [ 0.67694334  0.96245572  0.77544391  0.56962256  0.1270594   0.30630475
   0.81004084 -0.77677186  0.81189327  1.00118343  0.74586628  0.79555492
   0.86747579  0.04167099  0.13601231  0.00136727 -0.10407076  0.07686047]
 [ 0.76297234  0.79724837  0.86253904  0.79435372  0.27323306  0.31933428
   0.94978498 -0.93748998  0.93533261  0.74586628  1.00118343  0.94679667
   0.77983844  0.11321163  0.03677248  0.19446837  0.01423606  0.08579656]
 [ 0.81497566  0.83693508  0.88706577  0.71928618  0.08929427  0.1434227
   0.9941867  -0.95494487  0.98938264  0.79555492  0.94679667  1.00118343
   0.79595778 -0.01541878  0.07696823  0.20104818  0.00622636  0.10305714]
 [ 0.58593517  0.92691166  0.70660663  0.53700678  0.12211524  0.18996732
   0.80082111 -0.76722075  0.79763248  0.86747579  0.77983844  0.79595778
   1.00118343  0.19169941  0.16667971 -0.05621953 -0.22471583 -0.11814142]
 [-0.24988794  0.05200785 -0.22621115 -0.18061084  0.15313091  0.29608463
  -0.02757446  0.10342428 -0.01551372  0.04167099  0.11321163 -0.01541878
   0.19169941  1.00118343 -0.08846001 -0.12633227 -0.749751   -0.80307227]
 [ 0.23635777  0.14436828  0.1140589   0.04877032 -0.05843967  0.01561769
   0.07454578 -0.05266193  0.08386628  0.13601231  0.03677248  0.07696823
   0.16667971 -0.08846001  1.00118343 -0.03503155  0.1154338   0.09724079]
 [ 0.15720044 -0.01145212  0.26586088  0.17394649 -0.0320139   0.04347324
   0.21267959 -0.18527244  0.21495454  0.00136727  0.19446837  0.20104818
  -0.05621953 -0.12633227 -0.03503155  1.00118343  0.07740174  0.20523257]
 [ 0.29889034 -0.10455005  0.14627113  0.38266622  0.24016968 -0.02611148
   0.00563439 -0.11526213 -0.01867064 -0.10407076  0.01423606  0.00622636
  -0.22471583 -0.749751    0.1154338   0.07740174  1.00118343  0.89363767]
 [ 0.36598446  0.04640562  0.33312625  0.47186659  0.26804208  0.14408905
   0.1189581  -0.2171615   0.09940372  0.07686047  0.08579656  0.10305714
  -0.11814142 -0.80307227  0.09724079  0.20523257  0.89363767  1.00118343]]
In [ ]:
 

Correlation matrix

In [13]:
plt.figure(figsize = (12,12))
def plot_corr(X, size=11):
    corr = X.corr()
    sns.heatmap(corr, annot=True)
plot_corr(Data)
In [14]:
sns.pairplot(Data,diag_kind='kde')
Out[14]:
<seaborn.axisgrid.PairGrid at 0x7f4f706ed450>
In [15]:
columns = list(Data)[0:-1] # Excluding Outcome column 
Data[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); 
In [ ]:
 

Task3: Split the data into train and test (Suggestion: specify “random state” if you are using train_test_split from Sklearn) (5 marks)

In [16]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,stratify = y)
# 42 is just any random seed number
y_train.value_counts()/y_train.shape[0]*100
Out[16]:
car    50.675676
bus    25.844595
van    23.479730
Name: class, dtype: float64
In [17]:
y_train.value_counts().plot(kind='bar'); 
In [18]:
y_test.value_counts()/y_test.shape[0]*100
Out[18]:
car    50.787402
bus    25.590551
van    23.622047
Name: class, dtype: float64
In [19]:
y_test.value_counts().plot(kind='bar'); 
In [ ]:
 

Task4: Train a Support vector machine using the train set and get the accuracy on the test set (10 marks)

In [20]:
svm_classifier = svm.SVC(gamma=0.025, C=3)    
In [21]:
svm_classifier.fit(x_train , y_train)
Out[21]:
SVC(C=3, gamma=0.025)
In [22]:
y_pred = svm_classifier.predict(x_test)
In [23]:
print("Accuracy %0.2f " % (svm_classifier.score(x_train, y_train)))
print("Accuracy %0.2f " % (svm_classifier.score(x_test, y_test)))
Accuracy 0.98 
Accuracy 0.97 
In [ ]:
 
In [ ]:
 

Task5: Perform K-fold cross validation and get the cross validation score of the model (optional)

with k-fold cross validation with s_splits = 10, we can get the accuracy of 0.97 with +/- 0.02 i.e. around 95 %

In [50]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(svm_classifier, X, y, cv=10, scoring='accuracy')
print(scores)
[0.96470588 0.97647059 0.96470588 0.98823529 1.         0.97647059
 0.96428571 0.98809524 0.96428571 0.97619048]
In [57]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.02)
In [59]:
import numpy as np
from sklearn.model_selection import KFold
cv = KFold(n_splits=10, random_state=1, shuffle=True)
In [60]:
cv
Out[60]:
KFold(n_splits=10, random_state=1, shuffle=True)
In [61]:
scores = cross_val_score(svm_classifier, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
In [62]:
scores
Out[62]:
array([0.97647059, 0.96470588, 0.96470588, 0.96470588, 0.98823529,
       0.96470588, 0.98809524, 0.95238095, 0.97619048, 0.97619048])
In [63]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.97 (+/- 0.02)

Task 6: Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data – (10 points)

In [24]:
pca = PCA(n_components=18)
pca.fit(X)
Out[24]:
PCA(n_components=18)
In [25]:
print(pca.explained_variance_)
[9.40460261e+00 3.01492206e+00 1.90352502e+00 1.17993747e+00
 9.17260633e-01 5.39992629e-01 3.58870118e-01 2.21932456e-01
 1.60608597e-01 9.18572234e-02 6.64994118e-02 4.66005994e-02
 3.57947189e-02 2.74120657e-02 2.05792871e-02 1.79166314e-02
 1.00257898e-02 2.96445743e-03]
In [26]:
print(pca.components_)
[[ 2.75283688e-01  2.93258469e-01  3.04609128e-01  2.67606877e-01
   8.05039890e-02  9.72756855e-02  3.17092750e-01 -3.14133155e-01
   3.13959064e-01  2.82830900e-01  3.09280359e-01  3.13788457e-01
   2.72047492e-01 -2.08137692e-02  4.14555082e-02  5.82250207e-02
   3.02795063e-02  7.41453913e-02]
 [-1.26953763e-01  1.25576727e-01 -7.29516436e-02 -1.89634378e-01
  -1.22174860e-01  1.07482875e-02  4.81181371e-02  1.27498515e-02
   5.99352482e-02  1.16220532e-01  6.22806229e-02  5.37843596e-02
   2.09233172e-01  4.88525148e-01 -5.50899716e-02 -1.24085090e-01
  -5.40914775e-01 -5.40354258e-01]
 [-1.19922479e-01 -2.48205467e-02 -5.60143254e-02  2.75074211e-01
   6.42012966e-01  5.91801304e-01 -9.76283108e-02  5.76484384e-02
  -1.09512416e-01 -1.70641987e-02  5.63239801e-02 -1.08840729e-01
  -3.14636493e-02  2.86277015e-01 -1.15679354e-01 -7.52828901e-02
   8.73592034e-03  3.95242743e-02]
 [ 7.83843562e-02  1.87337408e-01 -7.12008427e-02 -4.26053415e-02
   3.27257119e-02  3.14147277e-02 -9.57485748e-02  8.22901952e-02
  -9.24582989e-02  1.88005612e-01 -1.19844008e-01 -9.17449325e-02
   2.00095228e-01 -6.55051354e-02  6.04794251e-01 -6.66114117e-01
   1.05526253e-01  4.74890311e-02]
 [ 6.95178336e-02 -8.50649539e-02  4.06645651e-02 -4.61473714e-02
  -4.05494487e-02  2.13432566e-01 -1.54853055e-02  7.68518712e-02
   2.17633157e-03 -6.06366845e-02 -4.56472367e-04 -1.95548315e-02
  -6.15991681e-02  1.45530146e-01  7.29189842e-01  5.99196401e-01
  -1.00602332e-01 -2.98614819e-02]
 [ 1.44875476e-01 -3.02731148e-01 -1.38405773e-01  2.48136636e-01
   2.36932611e-01 -4.19330747e-01  1.16100153e-01 -1.41840112e-01
   9.80561329e-02 -4.61674972e-01  2.36225434e-01  1.57820194e-01
  -1.35576278e-01  2.41356821e-01  2.03209257e-01 -1.91960802e-01
   1.56939174e-01 -2.41222817e-01]
 [ 4.51862331e-01 -2.49103387e-01  7.40350569e-02 -1.76912814e-01
  -3.97876601e-01  5.03413610e-01  6.49879382e-02  1.38112945e-02
   9.66573058e-02 -1.04552173e-01  1.14622578e-01  8.37350220e-02
  -3.73944382e-01  1.11952983e-01 -8.06328902e-02 -2.84558723e-01
   1.81451818e-02  1.57237839e-02]
 [-5.66136785e-01 -1.79851809e-01  4.34748988e-01  1.01998360e-01
  -6.87147927e-02  1.61153097e-01  1.00688056e-01 -2.15497166e-01
   6.35933915e-02 -2.49495867e-01  5.02096319e-02  4.37649907e-02
  -1.08474496e-01 -3.40878491e-01  1.56487670e-01 -2.08774083e-01
  -3.04580219e-01 -3.04186304e-02]
 [-4.84418105e-01 -1.41569001e-02 -1.67572478e-01 -2.30313563e-01
  -2.77128307e-01  1.48032250e-01  5.44574214e-02 -1.56867362e-01
   5.24978759e-03 -6.10362445e-02  2.97588112e-01  8.33669838e-02
   2.41655483e-01  3.20221887e-01  2.21054148e-02  1.01761758e-02
   5.17222779e-01  1.71506343e-01]
 [-2.60076393e-01  9.80779086e-02 -2.05031597e-01 -4.77888949e-02
   1.08075009e-01 -1.18266345e-01  1.65167200e-01 -1.51612333e-01
   1.93777917e-01  4.69059999e-01 -1.29986011e-01  1.58203940e-01
  -6.86493700e-01  1.27648385e-01  9.83643219e-02 -3.55150608e-02
   1.93956186e-02  6.41314778e-02]
 [ 4.65342885e-02  3.01323693e-03  7.06489498e-01 -1.07151583e-01
   3.85169721e-02 -2.62254132e-01 -1.70405800e-01 -5.76632611e-02
  -2.72514033e-01  1.41434233e-01  7.72596638e-02 -2.43226301e-01
  -1.58888394e-01  4.19188664e-01 -1.25447648e-02 -3.27808069e-02
   1.20597635e-01  9.19597847e-02]
 [ 1.20344026e-02 -2.13635088e-01  3.46330345e-04 -1.57049977e-01
   1.10106595e-01 -1.32935328e-01  9.55883216e-02  1.22012715e-01
   2.51281206e-01 -1.24529334e-01 -2.15011644e-01  1.75685051e-01
   1.90336498e-01  2.85710601e-01 -1.60327156e-03 -8.32589542e-02
  -3.53723696e-01  6.85618161e-01]
 [ 1.56136836e-01  1.50116709e-02 -2.37111452e-01 -3.07818692e-02
  -3.92804479e-02  3.72884301e-02  3.94638419e-02 -8.10394855e-01
  -2.71573184e-01 -7.57105808e-02 -1.53180808e-01 -3.07948154e-01
   3.76087492e-02  4.34650674e-02  9.94304634e-03  2.68915150e-02
  -1.86595152e-01  1.42380007e-01]
 [-6.00485194e-02  4.26993118e-01 -1.46240270e-01  5.21374718e-01
  -3.63120360e-01 -6.27796802e-02 -6.40502241e-02  1.86946145e-01
  -1.80912790e-01 -1.74070296e-01  2.77272123e-01 -7.85141734e-02
  -2.00683948e-01  1.46861607e-01  1.73360301e-02 -3.13689218e-02
  -2.31451048e-01  2.88502234e-01]
 [-9.67780251e-03 -5.97862837e-01 -1.57257142e-01  1.66551725e-01
  -6.36138719e-02 -8.63169844e-02 -7.98693109e-02  4.21515054e-02
  -1.44490635e-01  5.11259153e-01  4.53236855e-01 -1.26992250e-01
   1.09982525e-01 -1.11271959e-01  2.40943096e-02 -9.89651885e-03
  -1.82212045e-01  9.04014702e-02]
 [-6.50956666e-02 -2.61244802e-01  7.82651714e-02  5.60792139e-01
  -3.22276873e-01  4.87809642e-02  1.81839668e-02 -2.50330194e-02
   1.64490784e-01  1.47280090e-01 -5.64444637e-01 -6.85856929e-02
   1.47099233e-01  2.32941262e-01 -2.77589170e-02  2.78187408e-03
   1.90629960e-01 -1.20966490e-01]
 [ 6.00532537e-03 -7.38059396e-02  2.50791236e-02  3.59880417e-02
  -1.25847434e-02  2.84168792e-02  2.49652703e-01  4.21478467e-02
  -7.17396292e-01  4.70233017e-02 -1.71503771e-01  6.16589383e-01
   2.64910290e-02  1.42959461e-02 -1.74310271e-03  7.08894692e-03
  -7.67874680e-03 -6.37681817e-03]
 [-1.00728764e-02 -9.15939674e-03  6.94599696e-03 -4.20156482e-02
   3.12698087e-02 -9.99915816e-03  8.40975659e-01  2.38188639e-01
  -1.01154594e-01 -1.69481636e-02  6.04665108e-03 -4.69202757e-01
   1.17483082e-02  3.14812146e-03 -3.03156233e-03 -1.25315953e-02
   4.34282436e-02 -6.47700819e-03]]
In [ ]:
 
In [27]:
print(pca.explained_variance_ratio_)
[5.21860337e-01 1.67297684e-01 1.05626388e-01 6.54745969e-02
 5.08986889e-02 2.99641300e-02 1.99136623e-02 1.23150069e-02
 8.91215289e-03 5.09714695e-03 3.69004485e-03 2.58586200e-03
 1.98624491e-03 1.52109243e-03 1.14194232e-03 9.94191854e-04
 5.56329946e-04 1.64497408e-04]
In [28]:
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
In [29]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

Dimensionality Reduction

Now 7 dimensions seems very reasonable. With 7 variables we can explain over 95% of the variation in the original data!

In [30]:
pca7 = PCA(n_components=7)
pca7.fit(X)
print(pca7.components_)
print(pca7.explained_variance_ratio_)
Xpca7 = pca7.transform(X)
[[ 2.75283688e-01  2.93258469e-01  3.04609128e-01  2.67606877e-01
   8.05039890e-02  9.72756855e-02  3.17092750e-01 -3.14133155e-01
   3.13959064e-01  2.82830900e-01  3.09280359e-01  3.13788457e-01
   2.72047492e-01 -2.08137692e-02  4.14555082e-02  5.82250207e-02
   3.02795063e-02  7.41453913e-02]
 [-1.26953763e-01  1.25576727e-01 -7.29516436e-02 -1.89634378e-01
  -1.22174860e-01  1.07482875e-02  4.81181371e-02  1.27498515e-02
   5.99352482e-02  1.16220532e-01  6.22806229e-02  5.37843596e-02
   2.09233172e-01  4.88525148e-01 -5.50899716e-02 -1.24085090e-01
  -5.40914775e-01 -5.40354258e-01]
 [-1.19922479e-01 -2.48205467e-02 -5.60143254e-02  2.75074211e-01
   6.42012966e-01  5.91801304e-01 -9.76283108e-02  5.76484384e-02
  -1.09512416e-01 -1.70641987e-02  5.63239801e-02 -1.08840729e-01
  -3.14636493e-02  2.86277015e-01 -1.15679354e-01 -7.52828901e-02
   8.73592034e-03  3.95242743e-02]
 [ 7.83843562e-02  1.87337408e-01 -7.12008427e-02 -4.26053415e-02
   3.27257119e-02  3.14147277e-02 -9.57485748e-02  8.22901952e-02
  -9.24582989e-02  1.88005612e-01 -1.19844008e-01 -9.17449325e-02
   2.00095228e-01 -6.55051354e-02  6.04794251e-01 -6.66114117e-01
   1.05526253e-01  4.74890311e-02]
 [ 6.95178336e-02 -8.50649539e-02  4.06645651e-02 -4.61473714e-02
  -4.05494487e-02  2.13432566e-01 -1.54853055e-02  7.68518712e-02
   2.17633157e-03 -6.06366845e-02 -4.56472367e-04 -1.95548315e-02
  -6.15991681e-02  1.45530146e-01  7.29189842e-01  5.99196401e-01
  -1.00602332e-01 -2.98614819e-02]
 [ 1.44875476e-01 -3.02731148e-01 -1.38405773e-01  2.48136636e-01
   2.36932611e-01 -4.19330747e-01  1.16100153e-01 -1.41840112e-01
   9.80561329e-02 -4.61674972e-01  2.36225434e-01  1.57820194e-01
  -1.35576278e-01  2.41356821e-01  2.03209257e-01 -1.91960802e-01
   1.56939174e-01 -2.41222817e-01]
 [ 4.51862331e-01 -2.49103387e-01  7.40350569e-02 -1.76912814e-01
  -3.97876601e-01  5.03413610e-01  6.49879382e-02  1.38112945e-02
   9.66573058e-02 -1.04552173e-01  1.14622578e-01  8.37350220e-02
  -3.73944382e-01  1.11952983e-01 -8.06328902e-02 -2.84558723e-01
   1.81451818e-02  1.57237839e-02]]
[0.52186034 0.16729768 0.10562639 0.0654746  0.05089869 0.02996413
 0.01991366]
In [31]:
Xpca7
Out[31]:
array([[ 3.34162030e-01, -2.19026358e-01,  1.00158417e+00, ...,
         7.93007079e-02, -7.57446693e-01, -9.01124283e-01],
       [-1.59171085e+00, -4.20602982e-01, -3.69033854e-01, ...,
         6.93948582e-01, -5.17161832e-01,  3.78636988e-01],
       [ 3.76932418e+00,  1.95282752e-01,  8.78587404e-02, ...,
         7.31732265e-01,  7.05041037e-01, -3.45837595e-02],
       ...,
       [ 4.80917387e+00, -1.24931049e-03,  5.32333105e-01, ...,
        -1.34423635e+00, -2.17069763e-01,  5.73248962e-01],
       [-3.29409242e+00, -1.00827615e+00, -3.57003198e-01, ...,
         4.27680052e-02, -4.02491279e-01, -2.02405787e-01],
       [-4.76505347e+00,  3.34899728e-01, -5.68136078e-01, ...,
        -5.40510367e-02, -3.35637136e-01,  5.80978683e-02]])
In [32]:
sns.pairplot(pd.DataFrame(Xpca7))
Out[32]:
<seaborn.axisgrid.PairGrid at 0x7f4f5eafb150>

Task 7: Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier. (hint: set the same random state) (20 marks)

In [33]:
xpca_train, xpca_test, ypca_train, ypca_test = train_test_split(Xpca7, y, test_size=0.3, random_state=42)
In [34]:
clf_pca = svm.SVC(gamma=0.025, C=1000)    
clf_pca.fit(xpca_train , ypca_train)

print("Train Accuracy %0.2f " % (clf_pca.score(xpca_train, ypca_train)))
print("Test Accuracy %0.2f " % (clf_pca.score(xpca_test, ypca_test)))
Train Accuracy 0.99 
Test Accuracy 0.94 

Task 8: Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings (5 points)

In [ ]:
 
In [35]:
print("Accuracy %0.2f " % (svm_classifier.score(x_train, y_train)))
print("Accuracy %0.2f " % (svm_classifier.score(x_test, y_test)))
print("with PCA Train Accuracy %0.2f " % (clf_pca.score(xpca_train, ypca_train)))
print("with PCA Test Accuracy %0.2f " % (clf_pca.score(xpca_test, ypca_test)))
Accuracy 0.98 
Accuracy 0.97 
with PCA Train Accuracy 0.99 
with PCA Test Accuracy 0.94 

Observing the results above, the accuracy score with PCA with dimensianality reduction is good (test accuracy 0.94 ) closer to the accuracy without PCA and dimensionality reduction (test accuracy of 0.97).

In [ ]: